Consistent estimation of the architecture of 
multilayer perceptrons 

Joseph Rynkiewicz 

Universite Paris I - SAMOS/MATISSE 
90 rue de tolbiac, Paris - France 

Abstract. We consider regression models involving multilayer percep- 
trons (MLP) with one hidden layer and a Gaussian noise. The estimation 
of the parameters of the MLP can be done by maximizing the likelihood of 
the model. In this framework, it is difficult to determine the true number 
of hidden units using an information criterion, like the Bayesian infor- 
mation criteria (BIC), because the information matrix of Fisher is not 
invertible if the number of hidden units is overestimated. Indeed, the clas- 
sical theoretical justification of information criteria relies entirely on the 
invertibility of this matrix. However, using recent methodology introduced 
to deal with models with a loss of identifiability, we prove that suitable 
information criterion leads to consistent estimation of the true number of 
hidden units. 

1 Introduction 

Feed-forward neural networks are well known and popular tools to deal with non- 
linear statistic models. We can describe MLP regression model as a parametric 
family of probability density functions. If the noise of the regression model is 
Gaussian then it is well known (see Watanabe and Fukumizu [5]) that the max- 
imum likelihood estimator is equal to the least-squares estimator. Therefore, it 
is natural to consider Gaussian likelihood when we consider feed-forward neural 
networks from the statistical viewpoint. H. White [10] reviews learning in MLP 
in detail from the statistical viewpoint. However he left pending a important 
question: The asymptotic behavior of the estimator when an MLP in use has 
redundant hidden units and the Fisher information matrix is singular. Fuku- 
mizu |T] gives a response is the case of unbounded parameters. In this case, 
the maximum likelihood estimator (MLE) behavior is very different from the 
classical case since the likelihood ratio statistic can have an order lower bounded 
by 0(log(n)) with n the number of observations. This result indicates that the 
BIC criterion (see Schwarz [5]) is no more consistent. 

However, it is also a natural assumption to consider that the parameters 
are bounded. Indeed, computer calculations assume always this boundedness. 
Moreover it is a safe practice to bound the parameters in order to avoid numerical 
problems. 

The main result of this paper is to show that if we assume that the parame- 
ters are in a suitable compact set (i.e. bounded and closed) the likelihood ratio 
is tight, so the BIC is convergent. To obtain this result we use recent techniques 
introduced by Liu and Shao [6] and Gassiat [3] . These techniques consist in find- 
ing a parameterization separating the identifiable part of the parameter vector 



and the unidentifiable part, then we can obtain an asymptotic development of 
the likelihood of the model which allows us to show that a set of generalized 
score functions is a Donsker class. Finally, using a theorem of Gassiat [3], we 
conclude that suitable information criteria like the BIC are consistent because 
the likelihood ratio statistic is tight. 



2 The model 



2.1 Unidentifiability of the true regression function 

Le x = (xi, • • • ,Xd) T £ K d be the vector of inputs. The MLP function with k 
hidden units can be written : 



F e (x) = (3 + ai<t> (k + wfx) 
t=i 

with 9 = (/3,ai,-- - ,a k ,h,--- ,b k ,w n ,--- ,w ld ,--- ,w k d) C R 2k + 1 + kxd the pa- 
rameter vector of the model, and := (wn, • • • , Wid) T ■ The transfer function 
<\> will be assumed bounded and three times derivable. We assume also that the 
first, second and third derivatives of the function <fi: <j> , <f> and <fi are bounded. 
We consider that the data (X t , Ft) tgN . , are random variables verifying the equa- 
tion: 

Y t = F ga (X t ) + s t (1) 

Where (e t ) teN , is a sequence of independent and identically distributed (i.i.d.) 
7V(0, <j 2 ) variables. Note that it is assumed that the true model |T]) belongs to 
the considered set of parameter 0. Define the true number of hidden units as 
the smallest integer k° such that it exists 

f — (^p , a 1 , , a k0 , o 1; • • • , o k o,w n , • • • , w ld , • • • , w k a d ) £ K > Wit n 

Fgo equal to the true regression function of model |TJ). If we overestimate the 
true number of hidden units then the true parameter will be unidentifiable that 
is to say it will belong to an union of finitely many submanifolds of O and the 
dimension of at least one of the manifolds is larger than zero. For example, 
suppose we have a multilayer perceptron with two hidden units and the true 
function Fgo (x) is given by a perceptron with only one hidden unit, say Fgo (x) = 

ai(j>(wi T x). then any parameter of the set : 

Wi = w® , a\ = a? , p = b% = a-2 = } U 
wi = w 2 = Wi,ai + a 2 = a®, (3 = bi = b 2 = 0} 

realizes the function Fgo(x). In this framework, the likelihood ratio statistic 
does not follow the usual chi-square asymptotics, which requires uniqueness of 
the true parameter in the regularity conditions. An other difficulty appears if it 
exists a Wi equal to zero, because the function <p(bi + wf x) will be then constant 
as p. In order to avoid this source of unidentifiability we will constraint the set 
of parameters to verify for an rj > 0, and for all Wi £ 0: ||tUj|| > rj- 



2.2 Likelihood of the model 



Let us consider the sample Z% = (Xi,Yi) where Xi and Y{ follow the probability 
law induced by the model (fTJ). We assume that the law of Xi is q(x)Xd(x) with 
Xd the Lebesgue measure on M. d and the density function q(x) which is strictly 
positive for all x E M. d . The likelihood of the observation z := (x,y) for a 
parameter vector 9 will be written: 

f e ( z) = ^=e-^ ( y- F ^ 2 q (x) 

For sake of simplicity and concision we will assume that a 2 is known and fixed, 
but it is not hard to relax this assumption. We assume also that it is known 
that the true number of hidden units is smaller than M . M can be very large 
(for example 1000000), so this assumption is not restrictive in practice. Let 
:= Ui<fc<Af Ok be the set of parameter with an rj > such that for all k: 

e fe := 

{9 = (fi,ai,- ■ ■ ,a k ,bi, ■ ■ ■ ,b k , w n , • ■ • ,w xd , ■ ■ ■ ,w kd ) , VI < i < k, \\wi\\ > rj} 

The set 9 will be a compact as a finite union of compact sets. We note k° 
the true number of hidden units or equivalently the minimal number of hidden 
units such that Fgo € @ k o realizes the true regression function. The function 
f(z) := fgo(z) will be then the true density of the observation. 



3 Identification of the architecture of the MLP 

Let l n (9) := Y17=i ^°s(fe( z i)) be the log-likelihood of the model, note that this 
function is known up to the constant E™=i ^°s( x i)i independent of the parameter 
9. We define k, the estimator of maximum of penalized likelihood, as the number 
of hidden unit maximizing: 

T n (k) := max{;„(0) : 9 G 6 fc } - p n (k) (2) 

where p n (k) is a term which penalizes the log-likelihood in function of the num- 
ber of hidden units of the model. In the sequel, we will assume the following 
properties: 

H-l : The MLP functions are identifiable in the weak following sense: 
Vz, & + E£=i <4> (b'i + < T x) = P + Ei=i ^ {bi + wjx) & 

5 fi' + Ei=l a 'i 5 (b[,w[) = 5p + Ei=l a i S (bi,Wi) 

where S is the Dirac measure, i.e. 8${x) = 1 if x = 6 and Sg(x) = if 
x^9. 



H-2 : E(\X\ e ) < oo. 



H-3 : The functions of the set 



( (x k x l( j)"(b° + wfxj) , /(6° + wfx) 1<l<k », 

\\ I l<l<k<d, l<i<k° 

(x k ^{b° + wfxj) , L'($ + wfxj) ) 

V J i<k<d, i<i<k° V ') i<i<fco J 

are linearly independents in the Hilbert space L 2 (q\d). 

H-4 : p n (.) is increasing, p n {ki)—p n {k2) — > oo for all fci > and lim^oo Pn W = 
0. Note that such conditions are verified by BIC-like criterion. 

We get then the following result: 

Theorem 3.1 If the assumptions H-l, H-2, H-3 and H~4 are true then k k° . 

Remark Sussmann [7j has shown that, if the transfer functions <f> are sigmoids 
and if the parameters are hi positive (in order to avoid, a symmetry on the signs 
of (bi,Wi) and a*), then the assumption (H-l) is verified. Moreover, following a 
reasoning similar to Fukimizu [5] , we can show that the sigmoid functions verify 
the assumption (H-3). So, this result can be applied to the one hidden layer 
MLP model with sigmoidal transfert functions. 



Sketch of the proof Consider the functions: 

sg(z) := where ||.||2 is the norm L (fXd+i) 

llf -lib 

In order to prove the theorem, we have only to show that the set § := {sg, 8 € 9} 
is a Donsker class (cf van der Vaart [5]). Roughly speaking, a Donsker class is a 
set of functions for which the empirical distribution (with i.i.d. variables) verify 
a uniform central limit theorem, with limit distribution a Gaussian process. 
Then, the results will follow from the theorem 2.1 of Gassiat 3 . Firstly, we 
will get an asymptotic development of the likelihood ratio when the model is 
overparametrized. The Donsker property will follow from this development. 



3.1 Reparameterization of the model 

We will reparameterize the model using the same method as in Liu et Shao 
[6] for the mixing models. If ^ — 1 = 0, we have j3 = 0° and a vector 
t = {ti)i<i<k° exists such that = to < t\ < ■ ■ ■ < t k o < k and up to a per- 
mutation: b ti _ 1+1 = ■ ■■ = b ti = b% w u _ l+ \ = ■■■ =w u = w?, X)5=t i _ 1 +i a o = 
a® and otj = for t k a + 1 < j < k. Let be s; = 53/=t« i+i a J ~ a i an< ^ 
Qj = k^h ° 3 ' we S e ^ then the reparameterization 6 = ($t,^t) with $t = 

(/3,(&,0^,K)> = o i,(^)Ei,(«,)K.o + i)' ^ = (fe&,(6i)t+i.K-)t+i)- 



With this parameterization, for a fixed t, $ t is an identifiable parameter and all 
the non-identifiability of the model will be in ip t . Then, F^o ^ will be equal to 
Fgo if and only if 



<E>° — <R° h° h° ■ h° ... h° up ■ ■ 7/i° 

and now, we have: 

I±(z) — 1 x 

f e*p( y -^(y-(0°+T,ii 1 a-°<P(b°+w<> T x))y^ 

exp (y-(p + J2 k i=1 {si + a°) E*= t ,_ 1+ i Qj<t>Q>j + w j x ) ( 3 ) 

+ £j=t fe0 +i a i^ h i + W J X )) ) 



We get then an approximation of the likelihood ratio by derivating the expression 
§3§ with respect to each component of the parameter vector <p t and thanks the 
assumptions H-l, H-2, and H-3. 

Lemma 3.2 Let us write D($ t ,ip t ) ■= II f ^y M - 1|| 2 and 

e(z) :— 4y [y — + Ei=i a i0(^i + w i T x )j^j we 9 e ^ the following approxima- 
tion : 

with 

(*t - <&?) T / ($ o iV , t) (z) = (/? - /3° + E-=! + i«f a) 

ELi E*= tl _ 1+ i 9j (wj - ) ^a°0' (6? + a;) 
+ Ej=t fc0 +i a j<M & j + wja:)) e ( z ) 



and 



(i - sfe) ((*« - *?) t ^^,(*)/;^ w t w(*i - *?)) 

E*=i E*= t ,_ 1+ i 9j( w i ~ w°) T xx T (wj - w$)a$<l>" '(&? + tw? T x) 
+ E-=i E-L i4 _ 1+1 fe^ - + utf T s) 



Now, it is easy to show that the minimum number N(e) of e-brackets (cf van der 

i . This proves that 

S is a Donsker class ■ 
4 Conclusion 

Penalized likelihood criterium is used since many years for MLP models. How- 
ever, even if such technique seems to work in practice, there was no theoriti- 
cal justification for its use. Indeed, the classical asymptotic theory fails when 
the Fisher information matrix is singular. For the first time, we give sufficient 
conditions insuring the success of such selection procedure as the number of ob- 
servation tends to infinite. This result reinforces the use of classical statistical 
criterium like BIC in order to fit the architecture of MLP models. 
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